Assignment #0 - Data and Visualization

Due: Sep 10 (Tuesday) 11:00 pm

Amit Shetty

I. Overview

In this assignment, you are getting familiar to the tools including Python, numpy, matplotlib, pandas, and Jupyter notebook. Search for data one for classification and one for regression from any data source. The data should to be large enough more than 10,000 samples and more than 10 feature values.

II. Linear Algebra and Probability Theory

A. Linear Algebra


The section on linear algebra details the basics on thew topic needed to understand machine learning and deep learning algorithms. Relevant aspects of linear algebra are discussed in a concise way to provide a one stop shop to understand the language that most if not all AI algorithms use. The following portions of linear algebra and a short summary of each is discussed below Scalars: A scalar is just a single number which may take up different types of values, e.g. real-valued (slope of a line), natural number (number of units), etc. Vectors: A vector is an array of numbers of the same type (e.g. x ∈ ℝ), arranged in order and indexed as x1, x2 .... xn. They are expressed as: [x1, x2, ...xn] Matrices: A matrix is a 2-D array of elements, each element being indexed by two numbers. A real-valued matrix A of height m and width y is represented as A \in ℝ ^ {mxn} . The element in the i^{th} row and j^{th} column is indexed as A{i, j}. f(A){i, j} represents the element (i, j) of the matrix computed by applying the function f to A. Tensors: An array of numbers arranged in a regular grid with a variable number of axes is known as a tensor. The element at coordinates (i, j, k) of a tensor A is represented as A{i, j, k}. An important function of a matrix discussed is the transpose operation which is an operation on any matrix A is the transpose (denoted by A^T) which is its mirror image across the main diagonal. One of the basic operation done on almopst all algorithms in machine learning is matrix multiplication. For a matrix multiplication between two matrices A{mn} and B{kp} to exist, n and k should be equal. The resulting matrix C (= AB) has the shape m x p. Certain useful properties of matrices are discussed A(B+C) = AB + AC (Distributive Law) A(BC) = (AB)C (Associative Law) AB <> BA (not equal and therefore not commutative) (AB)^T = (B^T)(A^T) x^T y = (x^T y)^T = y^T x (Transpose Rule)

Identity Matrix is a matrix which doesn't change a vector when multiplied by the vector. The entries along its main diagonal is 1. All other entries are zero. A inverse is denoted using A^-1 and the following rule is applied when inverse matrix is used Now we define the inverse of a matrix, A^-1 as: A^-1 A = A A^-1 = Inverse of Matrix

Linear combination of a matrix and span which is a set of vectors is the set of all points that can be obtained from the linear combination of the vectors.However if none of the vectors is a linear combination of the other vectors, then the set of vectors are said to be linearly indipendent

Norms are used ot define the size of a vector.Different types of norms are discussed below: Euclidean Norm: This is the L^2 norm, which is heavily used in machine learning, and can be also calculated as x^T x. L1 Norm: It is used when the difference between the zero and non-zero elements is very important. Max Norm (also known as the L^infinity norm which is also the absolute value of the largest magnitude in the vector Frobenius Norm: Used to measure the size of a matrix (similar to the L^2 norm)

Now that the common matrix operations have been discussed, certain special types of matrices are discussed Diagonal Matrices: These matrices have non-zero entries only along the main diagonal, e.g. the identity matrix In. Some of the key features are: A square diagonal matrix can be represented as: diag(v) where the vector v represents the elements along tha main diagonal, multiplying by a diagonal matrix is computationally efficient. Dx can be calculated by simply scaling each x_i by v_i and a diagonal matrix need not be square. Symmetric Matrix: A = A^T Unit vector: A vector which has unit norm, i.e. ||x||_2 = 1. Orthogonal vectors: Two vectors x and y are orthogonal if x^T y = 0, which means that if both of them have non-zero norm, these vectors are at a 90 degree angle to each other. Orthogonal vectors having unit norm are called orthonormal vectors. Orthogonal Matrix: A matrix whose rows are mutually orthonormal (and columns too). Thus: A^T A = A A^T = I whichy implies A^-1 = A^T For orthogonal matrices, the inverse is easy and not so resource intensive to compute

B. Probability Theory


Probability Theory provides a mathematical framework for representing uncertainty. In AI applications, probability theory is used in two ways: The law of probability specifies how an AI system should reason. For example, for those who are familiar with probability, a classification problem (e.g. given an image, classify whether the image is that of a "cat" or a "dog") can be viewed as finding P(Y|X), where X is the input data and Y is the label that we are predicting. We design our algorithms to compute or approximate (in case computing exact value is not feasible) various expressions derived using probability theory. Use probability and statistics to analyze the behaviour of proposed AI systems. For example, we can analyze the accuracy of a classification model by observing how many of the predictions are correct. Two kinds of probablity are discussed There are two types of kinds of probability: Frequentist Probability: Probability theory was originally developed to analyze the frequencies of events (which are often repeatable, e.g. drawing a certain hand of cards in a poker game). When we say that an outcome has a probability p of occuring, it means that if we repeated the experiment infinte times, then proportion p of those repititions would produce that outcome. This kind of probability, rerelated directly to the rates at which events occur, is called frequentist probability. Bayesian Probability: The above reasoning doesn't seem applicable to experiments which are not repeatable, e.g. when a doctor says that a patient has 40% chance of having the flu, the probability represents a degree of belief, with 1 indicating absolute certainty that the patient has the flu and 0 indicating absoluting certainty that the patient doesn't have the flu. This kind of probability, related to qualitative levels of reasoning, is called Bayesian Probability. Random VAriable: A random variable is a variable, e.g. x, that can take on different values (states) randomly. Since it takes on values randomly, there must be a probability associated with each of those values. Thus, a random variable must be coupled with a probability distribution that specifies how likely each of the states are. There are two types of random variables: Discrete: The number of states are finite opr close to infinity Continuous: It is associated with a real value.

Probablity distribution: A probability distribution is a description of how likely a random variable or a set of random variables is, to take on each of its possible states.

Probability MAss Function: The probability distribution over discrete random variables is described using a probability mass function (PMF). A probability mass function acting over multiple variables is called a joint probability distribution.

When discussing probability variables for continous random variables a probability density function is used.

Sometimes we know the probability distribution over a set of variables and need to find the probability over just a subset of them. The probability over the subset is called marginal probability distribution.

Conditional Probablity: This is one of the most common types of probablity that basically means to calculate the probability of something happening assuming something else has already happened where it is indicated by P(Y|X). Condistional probablity is calculated using the bayseian classification formula

The expectation, or expected value, of some function f(x) with respect to a probability distribution P(x) is the average, or mean value, that the function f takes on, when x is drawn from P

Variance gives a measure of how much the values of random variable x vary

Correlation measures how two variables are linearly related. Covariance additionally measures the scale of the variables as well.

Certain probbality distributions are discussed such as bernouli distribution (a distribution over a single binary random variable), Multinoulli Distribution (similar to the Bernoulli distribution, with the difference being that the discrete random variable can have k different states) and Gaussian Distribution

III. Data

1. Classification Dataset

Zomato Restaurant Reviews Dataset


The goal of analysing this dataset is to see which area ion the city of Bangalore, India serves the best dishes. This dataset is also used to predict the eating habits of Bangloreans and help aid potentials buisiness owners to expand or invest in their own restaurant.

Dataset Sourced from:


https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants

Reading and Verifying Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv('zomato.csv')

dataset.head()
Out[1]:
url address name online_order book_table rate votes phone location rest_type dish_liked cuisines approx_cost(for two people) reviews_list menu_item listed_in(type) listed_in(city)
0 https://www.zomato.com/bangalore/jalsa-banasha... 942, 21st Main Road, 2nd Stage, Banashankari, ... Jalsa Yes Yes 4.1/5 775 080 42297555\r\n+91 9743772233 Banashankari Casual Dining Pasta, Lunch Buffet, Masala Papad, Paneer Laja... North Indian, Mughlai, Chinese 800 [('Rated 4.0', 'RATED\n A beautiful place to ... [] Buffet Banashankari
1 https://www.zomato.com/bangalore/spice-elephan... 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... Spice Elephant Yes No 4.1/5 787 080 41714161 Banashankari Casual Dining Momos, Lunch Buffet, Chocolate Nirvana, Thai G... Chinese, North Indian, Thai 800 [('Rated 4.0', 'RATED\n Had been here for din... [] Buffet Banashankari
2 https://www.zomato.com/SanchurroBangalore?cont... 1112, Next to KIMS Medical College, 17th Cross... San Churro Cafe Yes No 3.8/5 918 +91 9663487993 Banashankari Cafe, Casual Dining Churros, Cannelloni, Minestrone Soup, Hot Choc... Cafe, Mexican, Italian 800 [('Rated 3.0', "RATED\n Ambience is not that ... [] Buffet Banashankari
3 https://www.zomato.com/bangalore/addhuri-udupi... 1st Floor, Annakuteera, 3rd Stage, Banashankar... Addhuri Udupi Bhojana No No 3.7/5 88 +91 9620009302 Banashankari Quick Bites Masala Dosa South Indian, North Indian 300 [('Rated 4.0', "RATED\n Great food and proper... [] Buffet Banashankari
4 https://www.zomato.com/bangalore/grand-village... 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... Grand Village No No 3.8/5 166 +91 8026612447\r\n+91 9901210005 Basavanagudi Casual Dining Panipuri, Gol Gappe North Indian, Rajasthani 600 [('Rated 4.0', 'RATED\n Very good restaurant ... [] Buffet Banashankari

Dataset shows null values in the method below so data cleaning will be needed. Extra infomation about the dataset like the amount of data we are dealing with is also calculated

In [2]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
url                            51717 non-null object
address                        51717 non-null object
name                           51717 non-null object
online_order                   51717 non-null object
book_table                     51717 non-null object
rate                           43942 non-null object
votes                          51717 non-null int64
phone                          50509 non-null object
location                       51696 non-null object
rest_type                      51490 non-null object
dish_liked                     23639 non-null object
cuisines                       51672 non-null object
approx_cost(for two people)    51371 non-null object
reviews_list                   51717 non-null object
menu_item                      51717 non-null object
listed_in(type)                51717 non-null object
listed_in(city)                51717 non-null object
dtypes: int64(1), object(16)
memory usage: 6.7+ MB
In [3]:
dataset.shape
Out[3]:
(51717, 17)
In [4]:
# Checking the amount of data to clean
dataset.isna().sum()
Out[4]:
url                                0
address                            0
name                               0
online_order                       0
book_table                         0
rate                            7775
votes                              0
phone                           1208
location                          21
rest_type                        227
dish_liked                     28078
cuisines                          45
approx_cost(for two people)      346
reviews_list                       0
menu_item                          0
listed_in(type)                    0
listed_in(city)                    0
dtype: int64

There are certain columns that will not contribute to our analysis and will have to be dropped to avoid noise in our dataset

In [5]:
dataset.drop(columns=['url', 'phone', 'dish_liked', 'address', 'reviews_list', 'menu_item'], inplace=True)

Now that the uncessary columns have been dropped. We can proceed with removing data that has any null values in it as such incomplete data will give us skewed results which is how we get uncessary outliers

In [6]:
dataset.dropna(how='any',inplace=True)
In [7]:
dataset.isna().sum()
Out[7]:
name                           0
online_order                   0
book_table                     0
rate                           0
votes                          0
location                       0
rest_type                      0
cuisines                       0
approx_cost(for two people)    0
listed_in(type)                0
listed_in(city)                0
dtype: int64

To get an overview of the data, we will look at the best performing suburbs in the city in terms of the number of restaurants present

In [8]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(dataset.groupby(['location']).size().nlargest(5))
location
BTM                      4210
Koramangala 5th Block    2358
HSR                      2102
Indiranagar              1889
JP Nagar                 1842
dtype: int64
In [9]:
# Showing the 5 largest areas with market share
plt.figure(figsize=(12,12))
plt.pie(dataset.groupby(['location']).size().nlargest(5), labels=['BTM', 'HSR', 'Koramanga 5th Block', 'JP Nagar', 'Whitefield'])
plt.show()
In [10]:
# Showing the rest of the locations in Bangalore
plt.figure(figsize=(14,14))
locations=dataset['location'].value_counts()[:15]
sns.barplot(locations,locations.index)
plt.show()

Showing the type of dinner service best desribes what people are looking for in a restaurant

In [11]:
# Plotting the number of restaurants for each type in the dataset
plt.figure(figsize=(12,12))
sns.countplot(x=dataset['listed_in(type)'])
plt.title('Restaurent food type')
plt.xlabel('Restaurent Type')
plt.ylabel('Number of restaurants')
plt.show()

Showing number of restaurants who provide table booking and online ordering facility as they are in popular demand

In [12]:
# Checking which restauramts provide booking table facility
plt.title('Restaurents delivering online order?')
sns.countplot(x=dataset['online_order'])
fig=plt.gcf()
fig.set_size_inches(12,12)

We can see a large number of restaurants have the option of home deliv ery thus provng that for sucess, buisiness must have that option

In [13]:
# Checking which restauramts provide online ordering facility
plt.figure(figsize=(12,12))
plt.title('Restaurents delivering online order?')
sns.countplot(x=dataset['book_table'])
plt.show()

This is an interesting observation since most restaurants don't have the facility to book tables. There is scope for improvement here, sice businesses can use this data to allow the ability to book tables thus increasing their bsuiness

No city in the world is complete without its own variety of cuisines. We will now be looking into the cuisines that are popular with Bangloreans

In [14]:
# Get the popular 20 cuisines of the city
plt.figure(figsize=(12,12))
plt.title('Most popular cuisines in the city')
cuisines=dataset['cuisines'].value_counts()[:20]
sns.barplot(cuisines, cuisines.index)
plt.xlabel('Count')
plt.ylabel('Cuisines')
plt.show()
Key Observations:


Bangalore is a city in the south of India. It is surprising to see that North Indian food is the most popular.

Checking which type of dinner service gives restaurants the most rating. We will be focussing on booking table facility and online ordering as they were the most popular features in our previous observations

In [15]:
# Plotting a multivariate bar graph to show how th ability to book tables online affects the restaurants rating
plt.figure(figsize=(14,14))
sns.countplot(x='rate',hue='book_table',data=dataset)
plt.show()
In [16]:
# Plotting a multivariate bar graph to show how th ability to order online online affects the restaurants rating
plt.figure(figsize=(14,14))
sns.countplot(x='rate',hue='online_order',data=dataset)
plt.show()
Key Observations:


We can see from the above grpahs that restaurants have the highest rating when they offer booking table and online ordering facility for the simple reason that the type of cuisine prefered by Bangloreans are quick bites and done-outs

Indians care a lot about where they spend their money. So a distribution plot is used to show how the prices for dining (2 people)

In [17]:
dataset['approx_cost(for two people)']=dataset['approx_cost(for two people)'].apply(lambda x: int(x.replace(',','')))
plt.figure(figsize=(14,14))
sns.distplot(dataset['approx_cost(for two people)'])
plt.title('Approx Cost distribution for 2 people in rupees')
plt.show()
Key Observation:


As shown in the distribution plot, Bangloreans don't prefer pating extremelt high or low prices for food. We can see a smooth distribition in the plot to indicate that.

Regression Dataset


Gas sensor array temperature modulation Data Set


The goal of this dataset is to monitor 14 gas sensors in a controlled envornment when they are exposed in humid conditions to a mixture of carbon monoxide and synthetic air. The sensors have monitored the differences in the CO concentrations, humidity and temperatures as the sensor data for each of the 14 gas sesnors is also recorded.

Dataset sourced from:

https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+temperature+modulation

Reading and verifying data


NOTE: We are limiting our dataset to 10000 rows but the amount can be increased to nearly 20 times if needed. For the purposes of this assigment and to run the visualisations quickly we will be using a smaller dataset

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.animation import FuncAnimation

# Reading the gas sesnor dataset
dataset1 = pd.read_csv('20160930_203718.csv', nrows=10000)

dataset1.head()
Out[18]:
Time (s) CO (ppm) Humidity (%r.h.) Temperature (C) Flow rate (mL/min) Heater voltage (V) R1 (MOhm) R2 (MOhm) R3 (MOhm) R4 (MOhm) R5 (MOhm) R6 (MOhm) R7 (MOhm) R8 (MOhm) R9 (MOhm) R10 (MOhm) R11 (MOhm) R12 (MOhm) R13 (MOhm) R14 (MOhm)
0 0.000 0.0 49.7534 23.7184 233.2737 0.8993 0.2231 0.6365 1.1493 0.8483 1.2534 1.4449 1.9906 1.3303 1.4480 1.9148 3.4651 5.2144 6.5806 8.6385
1 0.309 0.0 55.8400 26.6200 241.6323 0.2112 2.1314 5.3552 9.7569 6.3188 9.4472 10.5769 13.6317 21.9829 16.1902 24.2780 31.1014 34.7193 31.7505 41.9167
2 0.618 0.0 55.8400 26.6200 241.3888 0.2070 10.5318 22.5612 37.2635 17.7848 33.0704 36.3160 42.5746 49.7495 31.7533 57.7289 53.6275 56.9212 47.8255 62.9436
3 0.926 0.0 55.8400 26.6200 241.1461 0.2042 29.5749 49.5111 65.6318 26.1447 58.3847 67.5130 68.0064 59.2824 36.7821 66.0832 66.8349 66.9695 50.3730 64.8363
4 1.234 0.0 55.8400 26.6200 240.9121 0.2030 49.5111 67.0368 77.8317 27.9625 71.7732 79.9474 79.8631 62.5385 39.6271 68.1441 62.0947 49.4614 52.8453 66.8445

Checking for any null values in the dataset and datatypes to ensure we can perform our analysis carefully

In [19]:
dataset1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
Time (s)              10000 non-null float64
CO (ppm)              10000 non-null float64
Humidity (%r.h.)      10000 non-null float64
Temperature (C)       10000 non-null float64
Flow rate (mL/min)    10000 non-null float64
Heater voltage (V)    10000 non-null float64
R1 (MOhm)             10000 non-null float64
R2 (MOhm)             10000 non-null float64
R3 (MOhm)             10000 non-null float64
R4 (MOhm)             10000 non-null float64
R5 (MOhm)             10000 non-null float64
R6 (MOhm)             10000 non-null float64
R7 (MOhm)             10000 non-null float64
R8 (MOhm)             10000 non-null float64
R9 (MOhm)             10000 non-null float64
R10 (MOhm)            10000 non-null float64
R11 (MOhm)            10000 non-null float64
R12 (MOhm)            10000 non-null float64
R13 (MOhm)            10000 non-null float64
R14 (MOhm)            10000 non-null float64
dtypes: float64(20)
memory usage: 1.5 MB
In [20]:
dataset1.shape
Out[20]:
(10000, 20)

The first part of our observations will be to monitor 3 key aspects of out data

  1. Carbon Monoxide Concentrations
  2. Humidity Level
  3. Temperature
  4. Heater Voltage

PART 1: Carbon Monoxide Concentrations

Observations: We can see from the graph wilf variations in the CO concetrations introduced. The reason is obvious since CO is our input so for testing purposes
In [21]:
plt.figure(figsize=(30,12))
sns.regplot(x="Time (s)", y="CO (ppm)", data=dataset1)
plt.show()

PART 2: Humidity Level


Humidity levels are interesting since we can see how chnages in the CO concetrations and heater voltage affect the humidity level in the chamber

In [22]:
plt.figure(figsize=(30,12))
sns.scatterplot(x="Time (s)", y="Humidity (%r.h.)", data=dataset1)
plt.show()

PART 3: Temperature


Since the temperature is highly controlled due to it being a key factor in the reactivity in the gas chamber, we can see more or less a much stable temperature variance

In [23]:
plt.figure(figsize=(30,12))
sns.scatterplot(x="Time (s)", y="Temperature (C)", data=dataset1)
plt.show()

PART 4: Heater Voltage
NOTE: Heater voltage is taken using the first 200 values since the voltage constantly fluctutates and s smaller subset shows better variations

In [24]:
plt.figure(figsize=(30,12))
plt.plot(dataset1['Time (s)'][:200,], dataset1['Heater voltage (V)'][:200,])
plt.show()
Important NOTE:


Since we are dealing with the reading of 14 different gas sensors, writing the visualisation for each for consume a lot of memoty and would not give an accurate birds eye view. In order to get one we use a pair plot which not only shows the sensor readings of the 14 gas sensors. It also shows the input variables that were used to achieve them.


NOTE: Pairplot take a while to execute despite the 10000 data limit. (on my system it has taken 2 minutes)

In [25]:
sns.pairplot(dataset1)
plt.show()

IV. Conclusions

Doing this challenge has been a interesting ride. Having explored two completely different datasets has been an enjoyable experience. While one of the datasets was something that I was very familiar with since I has ised their services, it opened up a while lot about how business decsisions are made using data.

Arriving at an analysis for the regression task proved more challenging than I had hoped. The area of the dataset was unfamilar to me but going through the documentation and understanding the data made a task a little easier. Monitoing input and output variations from each individual sensor using a single graph was a eureka moment for me.

References

List your references here... Follow either MLA or APA style!

  1. Poddar, Himanishu. "Zomato Bangalore Restaurants - Restaurants of Bengaluru". Kaggle. Web Scraped Classification Data, 31 March, 2019, https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants
  2. Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. 'Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models.' Analytica chimica acta 1013 (2018): 13-25
  3. Burgués, Javier, and Santiago Marco. 'Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors.' Analytica chimica acta 1019 (2018): 49-64.
  4. Matplotlib Library, https://matplotlib.org/3.1.1/contents.html
  5. Seaborn Library, https://seaborn.pydata.org/api.html
  6. Pandas Library, https://pandas.pydata.org/pandas-docs/stable/

Grading

DO NOT forget to submit your data! Your notebook is supposed to run fine without any error. You don't need to run any ML algorithm. This assignment only asks reading, visualizing, and writing your observations from it.

Note: this is a WRITING assignment. Proper writing is REQUIRED. Comments are not considered as writing.

Points Description
10 Introduction
20 Review
10 linear algebra
10 probability theory
60 Data
5 Introduction of data for regression & source
5 Reading the data
5 Preprocessing of the data
10 Visualization of the data
5 Preliminary observation
5 Introduction of data for Classification & source
5 Reading the data
5 Preprocessing of the data
10 Visualization of the data
5 Preliminary observation
5 Conclusions
5 References